Domain Adaptation for CRF-based Chinese Word Segmentation using Free Annotations
نویسندگان
چکیده
Supervised methods have been the dominant approach for Chinese word segmentation. The performance can drop significantly when the test domain is different from the training domain. In this paper, we study the problem of obtaining partial annotation from freely available data to help Chinese word segmentation on different domains. Different sources of free annotations are transformed into a unified form of partial annotation and a variant CRF model is used to leverage both fully and partially annotated data consistently. Experimental results show that the Chinese word segmentation model benefits from free partially annotated data. On the SIGHAN Bakeoff 2010 data, we achieve results that are competitive to the best reported in the literature.
منابع مشابه
Adapting Conventional Chinese Word Segmenter for Segmenting Micro-blog Text: Combining Rule-based and Statistic-based Approaches
We describe two adaptation strategies which are used in our word segmentation system in participating the Microblog word segmentation bake-off: Domain invariant information is extracted from the in-domain unlabelled corpus, and is incorporated as supplementary features to conventional word segmenter based on Conditional Random Field (CRF), we call it statistic-based adaptation. Some heuristic r...
متن کاملCRF-based Experiments for Cross-Domain Chinese Word Segmentation at CIPS-SIGHAN-2010
This paper describes our experiments on the cross-domain Chinese word segmentation task at the first CIPS-SIGHAN Joint Conference on Chinese Language Processing. Our system is based on the Conditional Random Fields (CRFs) model. Considering the particular properties of the out-of-domain data, we propose some novel steps to get some improvements for the special task.
متن کاملChinese Word Segmentation based on Mixing Multiple Preprocessor and CRF
This paper describes the Chinese Word Segmenter for our participation in CIPSSIGHAN-2010 bake-off task of Chinese word segmentation. We formalize the tasks as sequence tagging problems, and implemented them using conditional random fields (CRFs) model. The system contains two modules: multiple preprocessor and basic segmenter. The basic segmenter is designed as a problem of character-based tagg...
متن کاملVoting between Dictionary-Based and Subword Tagging Models for Chinese Word Segmentation
This paper describes a Chinese word segmentation system that is based on majority voting among three models: a forward maximum matching model, a conditional random field (CRF) model using maximum subword-based tagging, and a CRF model using minimum subwordbased tagging. In addition, it contains a post-processing component to deal with inconsistencies. Testing on the closed track of CityU, MSRA ...
متن کاملNon-Deterministic Segmentation for Chinese Lattice Parsing
Parsing Chinese critically depends on correct word segmentation for the parser since incorrect segmentation inevitably causes incorrect parses. We investigate a pipeline approach to segmentation and parsing using word lattices as parser input. We compare CRF-based and lexicon-based approaches to word segmentation. Our results show that the lattice parser is capable of selecting the correction s...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2014